YAWN: A Semantically Annotated Wikipedia XML Corpus

نویسندگان

  • Ralf Schenkel
  • Fabian M. Suchanek
  • Gjergji Kasneci
چکیده

The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce algorithms to annotate pages and links with concepts from the WordNet thesaurus. This annotation process exploits categorical information in Wikipedia, which is a high-quality, manually assigned source of information, extracts additional information from lists, and utilizes the invocations of templates with named parameters. We give examples how such annotations can be exploited for high-precision queries.

منابع مشابه

Semantically Annotated Snapshot of the English Wikipedia

This paper describes SW1, the first version of a semantically annotated snapshot of the English Wikipedia. In recent years Wikipedia has become a valuable resource for both the Natural Language Processing (NLP) community and the Information Retrieval (IR) community. Although NLP technology for processing Wikipedia already exists, not all researchers and developers have the computational resourc...

متن کامل

Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia

The hyperlink structure of Wikipedia constitutes a key resource for many Natural Language Processing tasks and applications, as it provides several million semantic annotations of entities in context. Yet only a small fraction of mentions across the entire Wikipedia corpus is linked. In this paper we present the automatic construction and evaluation of a Semantically Enriched Wikipedia (SEW) in...

متن کامل

Learning the semantics of Wikipedia hyperlinks

I claim that hyperlinks in Wikipedia entries often correspond to semantic relationships between concepts, described by the entries. This bachelor’s thesis discusses supervised methods to automatically identify new links that correspond to a given relation (hyper-/or hyponymy). Training data is collected by mapping Wikipedia articles to WordNet synsets and then marking links where a relation bet...

متن کامل

sloWNet: construction and corpus annotation

This paper presents a wordnet for Slovene which was created semi-automatically with a combination of approaches and multilingual resources, in particular a bilingual dictionary, a parallel corpus and Wikipedia. Analysis of the results shows that the dictionary approach yields a good core wordnet but requires substantial manual editing due to a lack of automatic word-sense disambiguation. This w...

متن کامل

WikiCoref: An English Coreference-annotated Corpus of Wikipedia Articles

This paper presents WikiCoref, an English corpus annotated for anaphoric relations, where all documents are from the English version of Wikipedia. Our annotation scheme follows the one of OntoNotes with a few disparities. We annotated each markable with coreference type, mention type and the equivalent Freebase topic. Since most similar annotation efforts concentrate on very specific types of w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007